DLGRAFE-Net: A double loss guided residual attention and feature enhancement network for polyp segmentation

Colon polyps represent a common gastrointestinal form. In order to effectively treat and prevent complications arising from colon polyps, colon polypectomy has become a commonly used therapeutic approach. Accurately segmenting polyps from colonoscopy images can provide valuable information for early diagnosis and treatment. Due to challenges posed by illumination and contrast variations, noise and artifacts, as well as variations in polyp size and blurred boundaries in polyp images, the robustness of segmentation algorithms is a significant concern. To address these issues, this paper proposes a Double Loss Guided Residual Attention and Feature Enhancement Network (DLGRAFE-Net) for polyp segmentation. Firstly, a newly designed Semantic and Spatial Information Aggregation (SSIA) module is used to extract and fuse edge information from low-level feature graphs and semantic information from high-level feature graphs, generating local loss-guided training for the segmentation network. Secondly, newly designed Deep Supervision Feature Fusion (DSFF) modules are utilized to fuse local loss feature graphs with multi-level features from the encoder, addressing the negative impact of background imbalance caused by varying polyp sizes. Finally, Efficient Feature Extraction (EFE) decoding modules are used to extract spatial information at different scales, establishing longer-distance spatial channel dependencies to enhance the overall network performance. Extensive experiments conducted on the CVC-ClinicDB and Kvasir-SEG datasets demonstrate that the proposed network outperforms all mainstream networks and state-of-the-art networks, exhibiting superior performance and stronger generalization capabilities.


Introduction
Deep learning (DL) has greatly improved the performance of automatic image segmentation in medical diagnosis.As a new research direction in the field of artificial intelligence, deep learning has been widely applied and researched in the field of medical image segmentation [1][2][3][4].With the continuous advancement of AI, a series of new methods are emerging in the healthcare sector to improve diagnostic accuracy and efficiency.Currently, cancer is a prominent area of research due to its complexity characterized by multiple genetic and epigenetic variations, and it ranks as the second leading cause of death globally [5,6].By implementing appropriate prevention, early detection, and treatment strategies, approximately 3.7 million lives could be saved annually [7,8].It's estimated that over one-third of death cases due to cancer can be prevented with timely interventions [9].
Colorectal cancer is a malignant tumor that originates in the colon or rectum, typically forming within the walls of the intestine [10,11].This type of cancer tends to progress slowly, and initially, there may be no apparent symptoms.However, the chances of cure are higher when it is detected early through endoscopic examination and if promptly treated [12,13].The endoscopy may sometimes cause the doctor to miss some potentially cancerous polyps due to the similar color of the polyps and the background.In order to solve this problem, the use of computer-based deep learning to assist doctors in diagnosis has become particularly important.
With the continuous expansion and development of deep learning applications, an increasing number of deep learning-based segmentation methods have been proposed recently [14][15][16][17].In cases where a doctor may have overlooked polyp regions, these segmentation methods can perform additional scans to guide the reanalysis of pathological information at that location.This, in turn, enhances a comprehensive assessment of the patient, facilitating the more effective detection and management of potential precancerous lesions.While these methods have made progress, the segmentation accuracy is compromised because polyps exhibit low contrast and similar colors to the surrounding environment, making it challenging to effectively determine the boundaries of polyp contours [4,18,19].
Inspired by the architectures of fully convolutional networks and ResNet, this paper introduces a Double Loss Guided Residual Attention and Feature Enhancement Network (DLGRA-FE-Net) for polyp segmentation tasks [20,21].The proposed network utilizes ResNet34 in the encoder to extract features, whereas the decoder utilizes the newly designed Efficient Feature Extraction (EFE) modules.Additionally, the newly designed Semantic and Spatial Information Aggregation (SSIA) module and Deep Supervision Feature Fusion (DSFF) modules are used to obtain local loss for the network and perform feature fusion.
The main contributions of this paper can be summarized as follows:

CNN-based polyp segmentation
ResNet (Residual Network) is a specialized architecture within Convolutional Neural Networks (CNNs) [20].In contrast to typical CNN structures, ResNet introduces residual units with identity mappings [22].In conventional deep neural networks, as the number of layers increases, issues such as vanishing or exploding gradients can arise, making it challenging for the network to converge.To address this problem, ResNet introduces the concept of residual units [4].Residual units allow the network to directly learn the shallow-layer input and then focus on learning the differences in the deeper layers, transforming the learning problem into one of learning residuals, [23].This simplifies the network's learning process.PraNet [24], proposed by Fan et al., is one of the most classic network structures in the field of polyp segmentation.To address the issue of unclear boundaries, these authors first utilize a Parallel Partial Decoder (PPD) to aggregate features from higher layers.Then, based on the combined features, they generate a global map as the initial guidance region for the subsequent components.Additionally, a Reverse Attention (RA) module is employed to explore boundary cues, establishing relationships between regions and boundary clues.Liu et al. proposed a thick and fine segmentation framework for polyp segmentation, based on depth and classification features [25].In order to improve the accuracy of polyp segmentation, the prediction graph of complex samples was used as prior information to guide the evolution of active contour models.ConvSegNet [26], proposed by Ige et al., introduces a novel Context Feature Refinement (CFR) module.This module extracts context information from the incoming feature map using parallel convolution layers with different kernel sizes.This enables the network to effectively identify and segment both small details and larger, more complex, structures in the input images.Recently, there has been a growing trend in proposing networks based on the Transformer architecture for medical image segmentation, following its introduction by Vaswani et al. in 2017 [27].Based on Transformer, Yang et al. proposed TranSEFusionNet to address the limitations of U-Net in medical image segmentation [28] and to reduce losing the information during the polyp image feature fusion.
Liu et al. introduced ECTransNet [29] in 2023, incorporating an Edge Complementary module and utilizing the Transformer structure.This module effectively fuses differences between features with various resolutions, allowing the network to exchange features across different levels and significantly enhancing edge details in polyp segmentation.Furthermore, the authors employ a feature aggregation decoder, adaptively merging high-level and low-level features using residual blocks.This strategy preserves target spatial information in high-level features while restoring local edges in low-level features, ultimately improving segmentation accuracy [30].However, when analyzing polyp images by transforming them into word vectors, Transformer faces a challenge-it may lose the original image's positional information.In tasks such as polyp segmentation, positional information is crucial for accurate analysis.Compared to fully convolution-al networks, the Transformer architecture may not perform optimally in capturing local information in polyp images [31].

Datasets
For the polyp image segmentation task, each pixel in the training images is labeled as either a polyp or non-polyp.The evaluation of DLGRAFE-Net was performed based on experiments conducted on the Kvasir-SEG [32] dataset and the CVC-ClinicDB [33] dataset.The Kvasir-SEG dataset consists of 1000 polyp images along with their corresponding annotated maps.These images were annotated by expert endoscopists at Oslo University Hospital.The CVC-ClinicDB dataset comprises 612 polyp images.The training set, used in the experiments, was composed of 900 images from Kvasir-SEG and 550 images from CVC-ClinicDB, with a total of 1450 images, which were randomly selected.There were two test sets, composed of the remaining 100 images from Kvasir-SEG and 62 images from CVC-ClinicDB.There were no duplicate images in the training set and test set.The polyp images are encoded using a 7×7 convolution and the ResNet34's BasicBlock residual module (in a four-layer structure) due to its moderate depth and strong feature extraction capabilities, as well as the advantages of residuals connections used, ensuring that gradients can be transmitted efficiently.A 7×7 convolution is used in order to capture large local features in the input images.Compared to smaller convolution kernels, 7×7 convolution kernels are able to cover a larger area in the images in a single operation, which helps capture large-scale features.This is especially important in medical images because polyps vary widely in size and shape.A filter set to [32,64,128,256,512] is used, according to the hardware configuration.A newly designed SSIA module is utilized to gather richer spatial and semantic feature information, which is fused to extract local features, reconstruct mask prediction results, learn module weight parameters, and update gradients through a local loss function defined in [18].Newly designed DSFF modules perform feature fusion between local feature graphs and multi-level features from the encoder, focusing the network training on strongly relevant regions.These modules transfer the weights of local feature graphs to the encoder, supervise the encoder network training, and ultimately utilize a novel EFE-based decoder, which simultaneously considers global relationships and spatial details, to reconstruct higher-resolution segmentation results [34].

Proposed network
The encoder module performs down-sampling on input images and extracts essential features.The encoder path is composed of a pre-trained ResNet34.Each residual block consists of two 3×3 convolutions with a stride of 2. Specifically, the four ResNet34 layers are made up of 3, 4, 6, and 3 residual blocks, respectively.The output E j of the j-th residual block is produced as follows: where Conv3×3denotes a 3×3 convolution with a stride of 2 and X denotes the current input to the convolutional layer.
The output SSIA mask of the SSIA module is used for updating gradients, through a local loss function defined in (19), in order to concentrate local feature graph information on strongly relevant regions.The SSIA output is formed as follows: where Conv7×7 denotes a 7×7 two-dimensional (2D) convolution with a step length of 2, ρ denotes the feature fusion operation, and E 16 denotes the output of the 16-th residual block.Each DSFF module fuses SSIA mask with the different scale features of the encoder, enabling the network to capture remote relationships while focusing on the training of areas of strong interest [34].Due to the guidance of SSIA mask , the encoder training can alleviate the negative effects of background imbalance caused by different polyp sizes.The output DSFF j (j = 3, 7, 13, 16) of a DSFF module after feature fusion between the j-th residuals and SSIA mask is produced as follows: where E j denotes the output of the j-th residuals.Unlike the encoder path, the decoder path is composed of a series of EFE modules for feature extraction.The output of each DSFF module is spliced with a corresponding up-sampled EFE module to further refine the output features of the module.The up-sampling unit with scale of 2 is used to up-sample the feature graphs received from the lower network layer.The output D of the decoding phase is produced as follows: where up denotes the up-sampled output, EFE denotes the decoder, and concat denotes the operation of joining the features of the same size together.

SSIA module.
To accurately extract polyps from colonoscopy images, a newly designed SSIA module, depicted in Fig 2, is utilized, which takes advantage of the richer spatial information in low-level feature graphs and the more abundant semantic information in highlevel feature graphs obtained from deep learning.
The SSIA module utilizes a convolutional structure to fuse low-level and high-level features, reconstructing mask prediction results.It combines deep coarse global features with shallow detailed global features to generate local feature graphs, reducing the aliasing effect caused by down-sampling.Simultaneously, the module learns weight parameters and updates gradients through a local loss function defined in (18).The SSIA module performs the following computations: where X 1 and X 2 denote the two input feature graphs, respectively, τ denotes an interpolation operation, concat denotes a channel concatenation operation, X 0 denotes the intermediate output feature graph, X 00 denotes the final output feature graph, Conv3×3 denotes a 2D convolution with a convolution kernel of 3, and Conv1×1 denotes a 2D convolution with a convolution kernel of 1.

DSFF modules.
In the convolutional feature extractor, newly designed DSFF modules are used to increase the receptive field of convolutional features.Guided by the local feature graph, each DSFF module extracts strongly-relevant region features from multiple scales of the encoder, reduces weights in irrelevant regions, and alleviates the negative impact caused by the imbalance in polyp background.The use of a 1×1 2D convolution imparts non-linearity to the feature graphs, broadening the network's capabilities.This is why a "deep" network is often preferred over a "wide" network.Finally, the two processed feature graphs are fused by concatenation and refined further using two 3×3 convolutions.The DSFF structure is illustrated in Fig 3 .Each DSFF module performs the following computations: where X 1 and X 2 denote the two input feature graphs respectively, σ denotes the sigmoid operation, � denotes an element-wise multiplication, X 0 denotes the intermediate output feature graph, X 00 denotes the final output feature graph, Avgpool indicates the average pooling operation.

EFE modules.
The main function of the EFE decoding modules is to reconstruct higher-resolution segmentation results based on the spatial relationships extracted by the encoder and the semantic spatial features obtained from the convolutional branch.It achieves this through up-sampling, capturing different semantic features using a multi-scale channel and spatial attention mechanism.This mechanism is highly effective in capturing local features, allowing the network to disregard obvious global information during the decoding process, thereby emphasizing local complexity and highlighting the boundaries of the segmentation targets.The EFE structure is illustrated in where X denotes the input feature graph, X out denotes the output feature graph, Conv N×N denotes a 2D convolution operation with a convolution kernel of N, Channel denotes the operation performed by the channel submodule, and Spatial denotes the operation performed by the spatial submodule.The spatial submodule converts various deformation data in space and automatically captures important regional features, whereas the channel submodule forms the importance of each channel through feature learning, and finally assigns different weights to each channel.

Loss functions and experimental setup
The proposed DLGRAFE-Net network utilizes a combined BCE-Dice loss [35] in a global loss function, in order to provide finer grained gradient information for the whole network training, along with improving its stability and sensitivity.The BCE-Dice loss combines the Binary Cross Entropy (BCE) loss [9] with the Dice loss [36], which are commonly used in binary segmentation tasks.The BCE loss is a loss function used to measure the disparity between a network's output and the actual labels in binary classification problems.For each sample, the BCE loss computes the cross-entropy loss between the probability distribution predicted by the network and the actual labels, and then averages the losses across all samples.The Dice loss performs well in scenarios with severe imbalance of positive and negative samples, emphasizing foreground region exploration during the network training process.Utilizing these two loss functions in the global network can effectively assist in learning accurate segmentation.The BCE loss is defined in [9], as follows: where N denotes the number of pixels, q i denotes the actual label of the i-th pixel (0 or 1), and p i denotes the predicted probability that the i-th pixel belongs to class 1.The Dice loss is calculated, as per [37], as follows: where q i denotes the target label of the i-th pixel, i.e., the binarized true label.The combined BCE-Dice loss is calculated as follows: where α denotes the weight factor with a value set to 0.5, based on experiments confirming that the network training reaches top performance when α = 0.5.
The global loss function (c.f., Fig 1), used for training the proposed network, is defined as follows: In addition, a local loss function is used in the local feature graph passing through the SSIA module of the proposed network (c.f., Fig 1), as it can effectively emphasize the overlap area between the prediction results and the real labels in the coding stage, handle the category imbalance between local features, and assist the global loss function to optimize segmentation performance.The local loss function is defined as follows: The utilized double loss, c.f. ( 17) and ( 18), can guide the proposed network to perform better in complex image segmentation tasks.
In the proposed network, the newly designed SSIA module is employed to integrate lowlevel semantic features extracted by the first convolutional layer (Conv7×7) with deep-level semantic features obtained from the ResNet34-based encoder.Subsequently, the loss defined in ( 18) is utilized as a local loss function, with a specific focus on addressing sample imbalance issues during the encoding stage.This approach ensures the comprehensive extraction of meaningful target feature information, guiding the decoding process effectively.
The Adam optimizer [38] is used, which adaptively adjusts the learning rate based on the historical gradient information of different parameters by calculating the first and second moment estimates of the gradient, which allows it to converge quickly and avoid falling into local minima during the network training.
The hardware configuration used in the conducted experiments utilized an Intel Core i5-12490 processor with a clock speed of 3.0 GHz, and a single NVIDIA RTX3060 graphics card with 12 GB memory.The hyperparameters for network training were set as follows: Batch_-Size = 4, Epochs = 200 (validation was performed on each epoch, and the network was trained using the Adam optimizer), Initial_Learning_Rate = 1×10 −4 , momentum = 0.9, Minimum_Learning_Rate = 1×10 −5 .The network structure was implemented using PyTorch.

Evaluation metrics
In order to objectively evaluate the network performance, training was conducted on the same dataset while keeping certain parameters constant.Common metrics such as Dice Similarity Coefficient (DSC), precision, recall, and Intersection over Union (IoU) were used to evaluate the results.These metrics are defined as follows: where TP denotes the true positive counts, FP denotes the false positive counts, and FN denotes the false negative counts.These selected metrics provide a comprehensive evaluation of the segmentation results, enabling a fair comparison of different networks performed on the same dataset.

Performance comparison with classic segmentation networks
In this set of experiments, the segmentation performance of the proposed DLGRAFE-Net network was compared with that of classic segmentation networks, including U-Net [21], UNet+ + [39], ResNet [20], and SegNet [40].The obtained results, shown in Tables 1 and 2, demonstrate that the proposed network outperforms all other networks according to all evaluation metrics.More specifically, on the CVC-ClinicDB dataset, the first runner-up is outperformed by 6.80, 3.83, 7.43, and 10.39 percentage points, based on DSC, precision, recall, and IoU, respectively.And on the Kvasir-SEG dataset, the first runner-up is outperformed by 3.42, 2.49, 3.51, and 4.99 percentage points, based on DSC, precision, recall, and IoU, respectively.The precision-recall curves, shown in Fig 5, further illustrate the superiority of the proposed DLGRAFE-Net network over the classical segmentation networks.
In order to further verify the generalization ability and robustness of the proposed network, we have tested it, along with other classical networks, on a previously unseen dataset, CVC-300 (containing 60 polyp images), which is different from the training set used.The obtained results, shown in Table 3, demonstrate that the proposed network outperforms all classical networks, according to all evaluation metrics.Additionally, it could be observed that when polyps are relatively small and their color is similar to the background color, the classical networks, ResNet and SegNet, lack global context information and the interaction of multi-scale features.Therefore, these networks may not accurately detect the location of polyps or even recognize the existence of polyps.In most of compared networks, boundary ambiguity and incomplete polyp segmentation appear in largepolyp segmentation.In the proposed network, a multi-scale spatial channel mechanism, provided by the EFE-based decoder, is used to capture global context information, and a crossscale feature interaction strategy of DSFF modules is used to integrate multi-stage features well, which allows the network to achieve good results in global and local feature extraction and recovery.According to the visual renderings, the proposed DLGRAFE-Net network achieves good segmentation results when the shape of the lesion area is irregular, the boundary is blurred, and the color is similar.Overall, DLGRAFE-Net proves to be more efficient in  extracting detailed features of polyps, thereby achieving much better segmentation performance than the other networks compared.

Ablation studies
In order to assess the effectiveness of different modules, newly designed for DLGRAFE-Net, multiple ablation study experiments were conducted on the Kvasir-SEG and CVC-ClinicDB datasets.The encoder of the proposed network is based on ResNet, whereas its decoder is based on U-Net.In these experiments, the newly designed EFE, SSIA, and DSFF modules were sequentially added in each step, as to compare it to the previous step.The obtained results are presented in Tables 4 and 5.As shown in Tables 4 and 5, adding the EFE modules to the baseline, in the first step, resulted in respective increase of all evaluation metrics, except for precision, on both datasets.This indicates that the EFE-based decoder effectively utilizes multi-scale channel and spatial attention mechanisms to reconstruct different semantic features, preserving multi-scale information.It also better distinguishes features in different directions in the images and more  effectively captures information in specific directions.Continuing with the addition of the SSIA module (without applying the local loss function) in the second step, led to further increase of all evaluation metrics, except for recall on the Kvasir-SEG dataset.This is due to the fact that the boundaries of polyp regions in Kvasir-SEG are fuzzier, compared to those in CVC-ClinicDB, and the leakage rate of pixels is increased, resulting in an increase in the number of false negatives (FN), which led to reducing the recall.However, this relatively small drop in recall can be sacrificed in order to greatly improve the values of all other metrics, which also suggests that the feature graphs, locally generated by the SSIA module, allow to effectively reduce the aliasing effects caused by down-sampling.The local loss function defined in (18), applied to the output of the SSIA module in the next step, allowed to further improve all evaluation metrics, compared to the previous step.In the final two steps, with the inclusion of the DSFF modules (without and with the local loss applied to the SSIA output), the best values of all evaluation metrics were achieved at the latter step, except for recall on the Kvasir-SEG dataset.This indicates that, guided by the local feature graphs, the DSFF modules allow the network to effectively focus on target area features, reducing irrelevant area weights and contributing to the improvement in segmentation performance.

Performance comparison with state-of-the-art segmentation networks
Finally, we compared the segmentation performance of the proposed DLGRAFE-Net network with that of state-of-the-art networks, based on their results reported in the corresponding literature sources.The results are shown in Tables 6 and 7, respectively for the CVC-ClinicDB dataset and Kvasir-SEG dataset.As can be seen from Table 6, DLGRAFE-Net outperforms all state-of-the-art networks on the CVC-ClinicDB dataset according to the two most important evaluation metrics in the field of image segmentation, namely DSC and IoU.Based on precision and recall, DLGRAFE-Net takes second place here.On the Kvasir-SEG dataset, the superiority of the proposed network is even more evident as it outperforms all state-of-the-art networks according to three (out of four) evaluation metrics, including the two most important ones, i.e., DSC and IoU.Only based on recall, DLGRAFE-Net takes fourth place here.

Conclusion
A Double Loss Guided Residual Attention and Feature Enhancement Network (DLGRAFE-Net) has been proposed in this paper for polyp segmentation.Through an effective combination of residual networks and feature fusion modules, DLGRAFE-Net significantly enhances the feature fitting of neural networks, capturing the positional and shape edge features of polyps, thereby further improving segmentation performance as evident from the provided experimental results obtained on two public datasets.Despite the success of DLGRAFE-Net in polyp segmentation, there are still several unresolved issues.For instance, we need to elaborate more effective preprocessing methods, adopting targeted approaches such as removing artifacts and noise, and performing image registration, which would contribute to enhancing the segmentation performance.Although the network proposed in this paper has significant advantages in terms of accuracy, reducing the computational complexity is also a difficult point for us to break through in the future.Addressing distribution differences between datasets is also a challenging problem worth further investigation.By tackling these issues, we can achieve more reliable polyp segmentation, which will have a positive impact on the field of medical image segmentation and clinical applications.

2. 3 . 1
Overall architecture.Based on a fully CNN architecture, the proposed DLGRAFE-Net network includes three new types of modules, as illustrated in Fig 1.

Fig 4 .
Each EFE module performs the

Figs 6 and 7
Figs 6 and 7 display predicted images output by different segmentation networks participating in this set of experiments.From these figures, it can be observed that the proposed DLGRAFE-Net network demonstrates more accurate segmentation of polyps.In comparison to other networks, DLGRAFE-Net excels in extracting features related to polyps and effectively mitigates the influence of similar background information around the polyps.Additionally, it could be observed that when polyps are relatively small and their color is similar to the background color, the classical networks, ResNet and SegNet, lack global context information and the interaction of multi-scale features.Therefore, these networks may not accurately detect the location of polyps or even recognize the existence of polyps.In most of compared networks, boundary ambiguity and incomplete polyp segmentation appear in largepolyp segmentation.In the proposed network, a multi-scale spatial channel mechanism, provided by the EFE-based decoder, is used to capture global context information, and a crossscale feature interaction strategy of DSFF modules is used to integrate multi-stage features well, which allows the network to achieve good results in global and local feature extraction and recovery.According to the visual renderings, the proposed DLGRAFE-Net network achieves good segmentation results when the shape of the lesion area is irregular, the boundary is blurred, and the color is similar.Overall, DLGRAFE-Net proves to be more efficient in